iNZight, Surveys, and the IDI

Tom Elliott

Te Rourou Tātaritanga
Victoria University of Wellington

tomelliott.co.nz

Updates

PhD thesis

  • submitted 9th April
  • defended 1st August
  • graduation tomorrow!

PhD thesis (TL;DR)

  • predicting buses is hard.
  • real-time traffic data from other buses to predict upcoming ones …
    • point estimates
    • interval estimates
  • \(\mathbb{P}\)(catch bus | I arrive by), \(\mathbb{P}\)(bus arrives before 9am)
  • useful for probabilistic journey planning

If you’re interested … tomelliott.co.nz/phd

Postdoc @ VUW @ UoA

  • MBIE Endeavour grant

    • Colin Simpson (VUW), Barry Milne (COMPASS), Andrew Sporle

    • Informatics for Social Services and Wellbeing …

    • more later!

  • Honorary position here (thanks James)

iNZight

library(iNZight)
iNZight()

iNZight main window

iNZight

  • my side-project since 2013/14

  • shifting focus as audience has evolved

iNZight

Before 2015

  • schools
  • some university

iNZight

2015–2019

  • education (school/university/MOOC)
  • unexpected places
    • data journalism
    • wildlife manager in Canada

iNZight

Recently

  • democratisation

    See Chris Wild’s talks featuring hits like We Will Plot You

  • rapid research development (Andrew Sporle)

    for organisations/groups with low/no money/time/both

iNZight

  • recent focus on surveys — now handled natively!

    • plots
    • summaries (tables of counts)
    • inference / modelling
    • data wrangling …
  • key goal is removal of barriers

Surveys and iNZight

Data

GUI

Explore

Export results/code

What if data is from a survey?

In

some_data <- read.csv('path/to/data')

some_svy <- survey::svydesign(~something,
    weights = ~FXWT3, # or another variable with 'WT' in it ...
    strata = ~does_it_have_any,
    fpc = ~i_dont_know_what_this_means
)

iNZight isn’t much better … or is it?!

Specify survey design

(Remember survey variables never have nice names)

mysurvey.zip

  • mysurvey.csv
  • mysurvey.svydesign

Demo

iNZight main window

mysurvey.svydesign

data = "mysurvey.csv"
weights = "wt0"
repweights = "^w[0-4]"
reptype = "JK1"
  • accessible
  • quickly open and explore
  • business as usual
    • plots
    • summaries/inference (population counts)
    • data wrangling

(A few) Details

iNZight’s package collection

9+ iNZight* packages

  • iNZight (GUI interface, collects user input, displays results)

  • iNZightModules (UI for time series, regression, maps, …)

  • iNZightPlots (graphs, summaries, inference)

  • iNZightTools (utility functions, data wrangling)

  • iNZightTS (time series)

  • iNZightMR (multiple response)

  • iNZightRegression (model summaries, residual plots)

  • iNZightMaps (lat/lng points, fill-in-the-shapefile maps)

  • plus vit and some others …

  • wrapper functions makes programming GUIs easier

    • inputs \(\equiv\) arguments
  • packages don’t need GUI

    • iNZightPlots::inzplot()

    • simple functions aimed towards novice coders

  • returns the R code

GUI \(\rightarrow\) high level functions \(\rightarrow\) lower-level (e.g., ggplot)

An example: Filtering data

library(iNZightTools)
iris_filtered <- filterNumeric(iris, "Sepal.Width", "<", 3.5)
head(iris_filtered)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          4.9         3.0          1.4         0.2  setosa
## 2          4.7         3.2          1.3         0.2  setosa
## 3          4.6         3.1          1.5         0.2  setosa
## 4          4.6         3.4          1.4         0.3  setosa
## 5          5.0         3.4          1.5         0.2  setosa
## 6          4.4         2.9          1.4         0.2  setosa
code(iris_filtered)
## [1] "iris %>% dplyr::filter(Sepal.Width < 3.5)"

A slightly more complex example

iris_agg <- aggregateData(iris, "Species", c("median", "var"), "Sepal.Length")
head(iris_agg)
## # A tibble: 3 x 3
##   Species    Sepal.Length_median Sepal.Length_var
##   <fct>                    <dbl>            <dbl>
## 1 setosa                     5              0.124
## 2 versicolor                 5.9            0.266
## 3 virginica                  6.5            0.404
code(iris_agg)
## [1] "iris %>% dplyr::group_by(Species) %>% dplyr::summarize(Sepal.Length_median = median(Sepal.Length, "
## [2] "    na.rm = TRUE), Sepal.Length_var = var(Sepal.Length, na.rm = TRUE), "                           
## [3] "    .groups = \"drop\")"

What about surveys?

  • modified wrapper functions to handle surveys

  • refactored GUI to pass around a ‘data-thing’ (data or survey)

library(survey)
data(api, package = "survey")
dclus2 <- svydesign(id = ~dnum+snum,
    fpc = ~fpc1+fpc2,
    data = apiclus2
)
dclus2_filtered <- filterNumeric(dclus2, "api99", ">=", 700)
code(dclus2_filtered)
## [1] "dclus2 %>% srvyr::as_survey() %>% srvyr::filter(api99 >= 700)"

Big thanks to the ‘srvyr’ package!

Te Rourou Tātaritanga

How does this all relate to my postdoc?

Rourou = basket

Nā tō rourou, nā taku rourou, ka ora ai te iwi.

(With your food basket and my food basket the people will thrive.)

Tātaritanga = analysis


Te Rourou Tātaritanga

“Tools for analytics and sharing data for the betterment of communities.”


Or: “Informatics for Social Services and Wellbeing”

Primary goals

  1. Improve data standards

  2. Promote Māori data sovereignty

  3. Develop systems to support access

  4. Evaluate synthesising of datasets

  5. Security and privacy implications

  6. Machine learning and AI methods

terourou.org

Primary goals

  1. Improve data standards

  2. Promote Māori data sovereignty

  3. Develop systems to support access

  4. Evaluate synthesising of datasets

  5. Security and privacy implications

  6. Machine learning and AI methods

terourou.org

The Integrated Data Infrastructure (IDI)

  • database connecting data across NZs sectors

  • high security environment

  • but also other unnecessary barriers: coding!

iNZight to the rescue!

  • high school and/or university

  • no coding necessary

  • easy to learn and relearn

  • iNZight in Stats NZ data lab …? Watch this space!

iNZight in the Data Lab (WIP)

Start confined to (example) small data sets …

  • primary researcher: SQL \(\Rightarrow\) CSV

  • non-coding researchers: graphs, tables, …

… and build from there!

iNZight outside the Data Lab

  • iwi groups, pacific nations, etc. with specific needs

    • population summaries (tables of counts)

    • regression models

    • demographic models …

  • easy to learn and relearn

    • repeat analyses after 6 months / 2 years

    • no (or low) (re)training or consultation costs

  • produces R code script

Bayesian demography

Why limit yourself to tables when you can fit hierarchical Bayesian models with model-specific priors, likelihoods, … ?

  • John Bryant’s R packages (dembase, demest, …) for Bayesian demography

  • R coding required (and data transformations, working with multi-dimensional arrays, …)

  • so we tested out iNZight’s new add-on system …

DEMO

Other projects

Both work and ‘fun’

IDI Search App

  • to get access to the IDI, you need to put together a research proposal

  • putting together a research proposal requires knowing what data is available to investigate

  • that data is hidden away in the IDI

IDI Search app

  • simple web app (ReactJS)

  • searchable database (schema, table name, variable name, descriptions where available)

  • prospective (and current) IDI researchers can explore what’s available

terourou.org/idisearch

DEMO

idi-search.web.app

Bus display v2

  • the display in 302 was broken

  • rebuilt it (again) using ReactJS + d3

  • uses newly available real-time occupancy

DEMO

tomelliott.co.nz/bus-display

Lots of ReactJS …

  • long-term goal: a prototype of iNZight built with ReactJS and R-serve

  • a single app for Windows / macOS / Linux / web

  • connecting to local/remote R server (user permissions, firewall, etc.)

NO DEMO

Thank you

Github: tmelliott | iNZightVIT | terourou

Twitter: @tomelliottnz | @iNZightUoA | @terourou

tomelliott.co.nz | inzight.nz | terourou.org